Deep Learning Math  ·  The Lab

The Tiny Transformer Lab

a real, miniature GPT — with attention you can watch

The real architecture behind ChatGPT, shrunk until it fits in your browser: token + positional embeddings, a residual stream, stacked blocks of multi-head self-attention + a GELU MLP, LayerNorm, and a softmax head — trained live by genuine, gradient-checked backprop.

this build: {{ nLayers }} layers · {{ nHeads }} heads · {{ dDim }} dims · ctx {{ T }} · ~{{ paramCount }} weights. same shape as GPT-2, just smaller — so it babbles, but with real attention. the model lives in transformer-engine.js, a standalone file you can open and edit.

one transformer block, the way GPT builds it
Embed +Pos
Ch.4
LayerNorm
normalize
Multi-Attn
Ch.4+6
MLP+GELU
Ch.5
Unembed
Ch.6
Backprop
Ch.2
residual stream runs straight through — every block reads it, adds to it, passes it on · ×{{ nLayers }} blocks
01

Train the transformer

Backprop through {{ nLayers }} blocks of real attention. The loss — surprise at the true next letter — ticks down.

step {{ step }} loss {{ lossDisp }}
learn rate {{ lrDisp }}
{{ nLayers }} layers · {{ nHeads }} heads · {{ dDim }} dims · vocab {{ V }} · ctx {{ T }}
optimizer Adam · trained on {{ wordsN }} names · engine gradient-checked.
02

Watch attention

Each head learns its own pattern. Pick a layer and head, type a word, and watch every letter look back. Train and see them sharpen.

layer
head

row = letter looking · column = letter looked at · lower triangle only: no peeking ahead.

03

Generate

Feed the model its own output, using the full context each step. Train it first; fresh weights spit noise.

temp {{ tempDisp }}
{{ s }}
04

The trick behind attention

“Look at the past” is just multiplying by a matrix. Step through how a plain average becomes learned attention — same shape, smarter weights.

{{ trickNote }}

From this to GPT-2 is only a matter of more.

Same architecture, every dial turned up — this lab already has the real pieces (multi-head, residual stream, LayerNorm, GELU).

layers
{{ nLayers }} → 12
heads
{{ nHeads }} → 12
dimensions
{{ dDim }} → 768
context
{{ T }} → 1024
vocabulary
{{ V }} → 50,257
weights
~{{ paramCount }} → 124M

Then train on the internet instead of names. The babble becomes language — but the dot products, softmax, residual stream, and gradient steps are exactly what you're running now.